1 MVP

You are given a set of data on housing sale prices for the last few years in King County (near Seattle) between May 2014 and May 2015.


We want you to build an explanatory model for the price of housing in King County, i.e. an interpretable model in which the included variables are statistically justifiable.

The variable definitions are:

id - Unique ID for each home sold
date - Date of the home sale
price - Price of each home sold
bedrooms - Number of bedrooms
bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
sqft_living - Square footage of the apartments interior living space
sqft_lot - Square footage of the land space
floors - Number of floors
waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
view - An index from 0 to 4 of how good the view of the property was
condition - An index from 1 to 5 on the condition of the apartment
grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design
sqft_above - The square footage of the interior housing space that is above ground level
sqft_basement - The square footage of the interior housing space that is below ground level
yr_built - The year the house was initially built
yr_renovated - The year of the house’s last renovation
zipcode - What zipcode area the house is in
lat - Lattitude
long - Longitude
sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors


2 Question 1

Tidy up the data ready for regression:

* You might like to think about removing some or all of `date`, `id`, `sqft_living15`, `sqft_lot15` and `zipcode` (`lat` and `long` provide a better measure of location in any event).
* Have a think about how to treat `waterfront`. Should we convert its type?
* We converted `yr_renovated` into a `renovated` logical variable, indicating whether the property had ever been renovated. You may wish to do the same.
* Have a think about how to treat `view`, `condition` and `grade`? Are they interval or categorical ordinal data types?
library(GGally)
library(modelr)
library(tidyverse)
houses <- read_csv("data/kc_house_data.csv")
## Rows: 21613 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (1): id
## dbl  (19): price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterf...
## dttm  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(houses)
## Rows: 21,613
## Columns: 21
## $ id            <chr> "7129300520", "6414100192", "5631500400", "2487200875", …
## $ date          <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-02…
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition     <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade         <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated  <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode       <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15    <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
# tidy up data. In particular treat condition and grade as factor, as they are
# ordinal categorical
houses_tidy <- houses %>%
  select(-c("id", "date", "sqft_living15", "sqft_lot15", "zipcode")) %>%
  mutate(waterfront = as.logical(waterfront)) %>%
  mutate(renovated = yr_renovated != 0) %>%
  select(-"yr_renovated") %>%
  mutate(view = as_factor(view)) %>% 
  mutate(condition = as_factor(condition)) %>%
  mutate(grade = as_factor(grade))

glimpse(houses_tidy)
## Rows: 21,613
## Columns: 16
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ view          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition     <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade         <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ renovated     <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…


3 Question 2

Check for aliased variables using the alias() function (this takes in a formula object and a data set). [Hint - formula price ~ . says ‘price varying with all predictors’, this is a suitable input to alias()]. Remove variables that lead to an alias. Check the ‘Elements of multiple regression’ lesson for a dropdown containing further information on finding aliased variables in a dataset.

Solution

# Alias is useful to check if we have aliased variables, i.e. one or more
# variables that can be computed from other variables
alias(price ~ ., data = houses_tidy)
## Model :
## price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
##     waterfront + view + condition + grade + sqft_above + sqft_basement + 
##     yr_built + lat + long + renovated
## 
## Complete :
##               (Intercept) bedrooms bathrooms sqft_living sqft_lot floors
## sqft_basement  0           0        0         1           0        0    
##               waterfrontTRUE view1 view2 view3 view4 condition2 condition3
## sqft_basement  0              0     0     0     0     0          0        
##               condition4 condition5 grade3 grade4 grade5 grade6 grade7 grade8
## sqft_basement  0          0          0      0      0      0      0      0    
##               grade9 grade10 grade11 grade12 grade13 sqft_above yr_built lat
## sqft_basement  0      0       0       0       0      -1          0        0 
##               long renovatedTRUE
## sqft_basement  0    0
# seems that sqft_basement can be computed from sqft_living - sqft_above.
# let's drop sqft_living leaving just the two contributions sqft_basement and 
# sqft_above
houses_tidy <- houses_tidy %>%
  select(-"sqft_living")

glimpse(houses_tidy)
## Rows: 21,613
## Columns: 15
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ view          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition     <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade         <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ renovated     <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…


4 Question 3

Systematically build a regression model containing up to four main effects (remember, a main effect is just a single predictor with coefficient), testing the regression diagnostics as you go

* splitting datasets into numeric and non-numeric columns might help `ggpairs()` run in manageable time, although you will need to add either a `price` or `resid` column to the non-numeric dataframe in order to see its correlations with the non-numeric predictors. You can do it in the following way:
houses_tidy_numeric <- houses_tidy %>%
  select_if(is.numeric)

houses_tidy_nonnumeric <- houses_tidy %>%
  select_if(function(x) !is.numeric(x))

houses_tidy_nonnumeric$price <- houses_tidy$price

Solution

ggpairs(houses_tidy_numeric)

ggpairs(houses_tidy_nonnumeric)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can also subset your ggpairs plot doing the following:


ggpairs(houses_tidy_numeric, columns = c(1, 2, 3, 4))

and the same in subsequent rounds of predictor selection with the resid column.

Remember, if you are not sure whether including a categorical predictor is statistically justified, run an anova() test passing in the models with- and without the categorical predictor and check the p-value of the test.

houses_tidy_numeric <- houses_tidy %>%
  select_if(is.numeric)

houses_tidy_nonnumeric <- houses_tidy %>%
  select_if(function(x) !is.numeric(x))

houses_tidy_nonnumeric$price <- houses_tidy$price

ggpairs(houses_tidy_numeric)

ggpairs(houses_tidy_nonnumeric)

Correlation of sqft_above with price looks pretty promising, but split of price by grade and waterfront also look decent.

houses_tidy %>%
  ggplot(aes(x = grade, y = price)) +
  geom_boxplot()

houses_tidy %>%
  ggplot(aes(x = waterfront, y = price)) +
  geom_boxplot()

Start building the models:

# model with price and sqft above
mod1a <- lm(price ~ sqft_above, data = houses_tidy)
summary(mod1a)
## 
## Call:
## lm(formula = price ~ sqft_above, data = houses_tidy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -913132 -165624  -41468  109327 5339232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59953.2     4729.8   12.68   <2e-16 ***
## sqft_above     268.5        2.4  111.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 292200 on 21611 degrees of freedom
## Multiple R-squared:  0.3667, Adjusted R-squared:  0.3667 
## F-statistic: 1.251e+04 on 1 and 21611 DF,  p-value: < 2.2e-16
mod1b <- lm(price ~ grade, data = houses_tidy)
summary(mod1b)
## 
## Call:
## lm(formula = price ~ grade, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1929615  -135853   -35090    89080  5565658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   142000     254499   0.558 0.576878    
## grade3         63667     293870   0.217 0.828484    
## grade4         72381     258849   0.280 0.779767    
## grade5        106524     255024   0.418 0.676169    
## grade6        159920     254561   0.628 0.529868    
## grade7        260590     254513   1.024 0.305904    
## grade8        400853     254520   1.575 0.115285    
## grade9        631513     254547   2.481 0.013112 *  
## grade10       929771     254611   3.652 0.000261 ***
## grade11      1354842     254817   5.317 1.07e-07 ***
## grade12      2049222     255909   8.008 1.23e-15 ***
## grade13      3567615     264106  13.508  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 254500 on 21601 degrees of freedom
## Multiple R-squared:  0.5197, Adjusted R-squared:  0.5195 
## F-statistic:  2125 on 11 and 21601 DF,  p-value: < 2.2e-16
mod1c <- lm(price ~ waterfront, data = houses_tidy)
summary(mod1c)
## 
## Call:
## lm(formula = price ~ waterfront, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1376876  -211564   -81564   108436  7168436 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      531564       2416  220.00   <2e-16 ***
## waterfrontTRUE  1130312      27822   40.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 353900 on 21611 degrees of freedom
## Multiple R-squared:  0.07095,    Adjusted R-squared:  0.07091 
## F-statistic:  1650 on 1 and 21611 DF,  p-value: < 2.2e-16

Grade looks the most promising, but some of the grade level coefficients are insignificant.

From the F-test at the bottom of the regression output tests against the null model (i.e. intercept only). But, if we want, we can replicate this using a separate anova. For this, our null model would be to regress price on intercept only.

You can do this in the following way:

null_model <- lm(price ~ 1, data = houses_tidy)
grade_model <- lm(price ~ grade, data = houses_tidy)
anova(null_model, grade_model)
# grade is significant, let's keep it. Now plot diagnostics
par(mfrow = c(2, 2))
plot(mod1b)
## Warning: not plotting observations with leverage one:
##   19453

houses_resid <- houses_tidy %>%
  add_residuals(mod1b) %>%
  select(-c("price", "grade"))

houses_resid_numeric <- houses_resid %>%
  select_if(is.numeric)

houses_resid_nonnumeric <- houses_resid %>%
  select_if(function(x) !is.numeric(x))

houses_resid_nonnumeric$resid <- houses_resid$resid
ggpairs(houses_resid_numeric)

ggpairs(houses_resid_nonnumeric)

Our predictor lat has highest correlation with residuals, but waterfront and view look pretty promising. Try all three…

mod2a <- lm(price ~ grade + lat, data = houses_tidy)
summary(mod2a)
## 
## Call:
## lm(formula = price ~ grade + lat, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1863783  -112379   -28181    67454  5533910 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -29949769     610666 -49.044  < 2e-16 ***
## grade3         187923     276126   0.681 0.496151    
## grade4          95367     243212   0.392 0.694978    
## grade5         126354     239618   0.527 0.597980    
## grade6         158453     239183   0.662 0.507673    
## grade7         246275     239138   1.030 0.303093    
## grade8         378635     239144   1.583 0.113369    
## grade9         603010     239170   2.521 0.011701 *  
## grade10        891752     239230   3.728 0.000194 ***
## grade11       1311124     239425   5.476  4.4e-08 ***
## grade12       2007093     240450   8.347  < 2e-16 ***
## grade13       3502099     248154  14.113  < 2e-16 ***
## lat            633100      11822  53.554  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239100 on 21600 degrees of freedom
## Multiple R-squared:  0.576,  Adjusted R-squared:  0.5758 
## F-statistic:  2445 on 12 and 21600 DF,  p-value: < 2.2e-16
mod2b <- lm(price ~ grade + waterfront, data = houses_tidy)
summary(mod2b)
## 
## Call:
## lm(formula = price ~ grade + waterfront, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1929615  -132969   -32393    91481  4778940 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      142000     244366   0.581   0.5612    
## grade3            63667     282169   0.226   0.8215    
## grade4            72381     248543   0.291   0.7709    
## grade5            92834     244870   0.379   0.7046    
## grade6           155043     244426   0.634   0.5259    
## grade7           258469     244379   1.058   0.2902    
## grade8           395393     244386   1.618   0.1057    
## grade9           623595     244413   2.551   0.0107 *  
## grade10          909321     244474   3.719   0.0002 ***
## grade11         1313326     244674   5.368 8.06e-08 ***
## grade12         1947993     245731   7.927 2.35e-15 ***
## grade13         3567615     253590  14.068  < 2e-16 ***
## waterfrontTRUE   828234      19363  42.773  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 244400 on 21600 degrees of freedom
## Multiple R-squared:  0.5572, Adjusted R-squared:  0.557 
## F-statistic:  2265 on 12 and 21600 DF,  p-value: < 2.2e-16
mod2c <- lm(price ~ grade + view, data = houses_tidy)
summary(mod2c)
## 
## Call:
## lm(formula = price ~ grade + view, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1724708  -128875   -30577    91175  5534514 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   142000     240508   0.590 0.554918    
## grade3         63667     277715   0.229 0.818675    
## grade4         53178     244622   0.217 0.827909    
## grade5         90541     241005   0.376 0.707157    
## grade6        150434     240568   0.625 0.531762    
## grade7        250923     240522   1.043 0.296846    
## grade8        376825     240529   1.567 0.117211    
## grade9        588427     240556   2.446 0.014449 *  
## grade10       861368     240619   3.580 0.000345 ***
## grade11      1246776     240821   5.177 2.27e-07 ***
## grade12      1853274     241873   7.662 1.90e-14 ***
## grade13      3362708     249623  13.471  < 2e-16 ***
## view1         208846      13343  15.652  < 2e-16 ***
## view2         139209       8023  17.351  < 2e-16 ***
## view3         228007      10937  20.847  < 2e-16 ***
## view4         596719      13855  43.070  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 240500 on 21597 degrees of freedom
## Multiple R-squared:  0.5711, Adjusted R-squared:  0.5708 
## F-statistic:  1917 on 15 and 21597 DF,  p-value: < 2.2e-16
# lat is significant and has a slightly higher r^2, let's keep model2a
par(mfrow = c(2, 2))
plot(mod2a)
## Warning: not plotting observations with leverage one:
##   19453

houses_resid <- houses_tidy %>%
  add_residuals(mod2a) %>%
  select(-c("price", "grade", "lat"))

houses_resid_numeric <- houses_resid %>%
  select_if(is.numeric)

houses_resid_nonnumeric <- houses_resid %>%
  select_if(function(x) !is.numeric(x))

houses_resid_nonnumeric$resid <- houses_resid$resid
ggpairs(houses_resid_numeric)

ggpairs(houses_resid_nonnumeric)

Now sqft_basement has strongest correlation with residuals, but also compare against model with waterfront and view.

mod3a <- lm(price ~ grade + lat + sqft_basement, data = houses_tidy)
summary(mod3a)
## 
## Call:
## lm(formula = price ~ grade + lat + sqft_basement, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1642900  -110556   -22563    70920  5247787 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.775e+07  5.872e+05 -47.259  < 2e-16 ***
## grade3         1.788e+05  2.645e+05   0.676 0.499028    
## grade4         9.259e+04  2.330e+05   0.397 0.691090    
## grade5         1.188e+05  2.296e+05   0.517 0.604927    
## grade6         1.391e+05  2.291e+05   0.607 0.543956    
## grade7         2.028e+05  2.291e+05   0.885 0.376173    
## grade8         3.299e+05  2.291e+05   1.440 0.149897    
## grade9         5.553e+05  2.291e+05   2.424 0.015379 *  
## grade10        8.298e+05  2.292e+05   3.620 0.000295 ***
## grade11        1.228e+06  2.294e+05   5.355 8.66e-08 ***
## grade12        1.880e+06  2.304e+05   8.159 3.55e-16 ***
## grade13        3.281e+06  2.378e+05  13.799  < 2e-16 ***
## lat            5.868e+05  1.137e+04  51.589  < 2e-16 ***
## sqft_basement  1.587e+02  3.607e+00  43.992  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 229100 on 21599 degrees of freedom
## Multiple R-squared:  0.6109, Adjusted R-squared:  0.6106 
## F-statistic:  2608 on 13 and 21599 DF,  p-value: < 2.2e-16
mod3b <- lm(price ~ grade + lat + waterfront, data = houses_tidy)
summary(mod3b)
## 
## Call:
## lm(formula = price ~ grade + lat + waterfront, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1862555  -109244   -24781    70145  4724767 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -30511083     581588 -52.462  < 2e-16 ***
## grade3            190241     262923   0.724 0.469343    
## grade4             95796     231583   0.414 0.679130    
## grade5            112654     228161   0.494 0.621488    
## grade6            153414     227746   0.674 0.500562    
## grade7            243828     227703   1.071 0.284264    
## grade8            372609     227709   1.636 0.101783    
## grade9            594340     227734   2.610 0.009066 ** 
## grade10           870026     227792   3.819 0.000134 ***
## grade11          1267641     227979   5.560 2.72e-08 ***
## grade12          1902270     228964   8.308  < 2e-16 ***
## grade13          3500877     236288  14.816  < 2e-16 ***
## lat               644910      11259  57.278  < 2e-16 ***
## waterfrontTRUE    851219      18046  47.168  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 227700 on 21599 degrees of freedom
## Multiple R-squared:  0.6156, Adjusted R-squared:  0.6154 
## F-statistic:  2661 on 13 and 21599 DF,  p-value: < 2.2e-16
mod3c <- lm(price ~ grade + lat + view, data = houses_tidy)
summary(mod3c)
## 
## Call:
## lm(formula = price ~ grade + lat + view, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1657374  -105914   -21840    69071  5500673 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -30312052     571751 -53.016  < 2e-16 ***
## grade3         189419     258409   0.733 0.463554    
## grade4          76878     227609   0.338 0.735544    
## grade5         110481     224243   0.493 0.622243    
## grade6         148798     223836   0.665 0.506208    
## grade7         236266     223794   1.056 0.291103    
## grade8         353834     223801   1.581 0.113887    
## grade9         558576     223826   2.496 0.012583 *  
## grade10        821334     223885   3.669 0.000245 ***
## grade11       1200200     224073   5.356 8.58e-08 ***
## grade12       1807909     225052   8.033 9.97e-16 ***
## grade13       3292412     232265  14.175  < 2e-16 ***
## lat            640722      11069  57.882  < 2e-16 ***
## view1          200472      12416  16.147  < 2e-16 ***
## view2          143253       7465  19.189  < 2e-16 ***
## view3          245505      10181  24.114  < 2e-16 ***
## view4          598302      12891  46.412  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 223800 on 21596 degrees of freedom
## Multiple R-squared:  0.6287, Adjusted R-squared:  0.6285 
## F-statistic:  2286 on 16 and 21596 DF,  p-value: < 2.2e-16
# view model is best, keep mod3c
par(mfrow = c(2, 2))
plot(mod3c)
## Warning: not plotting observations with leverage one:
##   19453

houses_resid <- houses_tidy %>%
  add_residuals(mod3c) %>%
  select(-c("price", "grade", "lat", "view"))

houses_resid_numeric <- houses_resid %>%
  select_if(is.numeric)

houses_resid_nonnumeric <- houses_resid %>%
  select_if(function(x) !is.numeric(x))

houses_resid_nonnumeric$resid <- houses_resid$resid
ggpairs(houses_resid_numeric)

ggpairs(houses_resid_nonnumeric)

sqft_basement has highest correlation with residuals. Let’s test against all remaining categorical predictors:

mod4a <- lm(price ~ grade + lat + view + sqft_basement, data = houses_tidy)
summary(mod4a)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1721908  -104563   -19475    70497  5300373 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.856e+07  5.596e+05 -51.038  < 2e-16 ***
## grade3         1.822e+05  2.518e+05   0.723 0.469421    
## grade4         7.896e+04  2.218e+05   0.356 0.721892    
## grade5         1.072e+05  2.185e+05   0.490 0.623801    
## grade6         1.359e+05  2.182e+05   0.623 0.533255    
## grade7         2.054e+05  2.181e+05   0.942 0.346393    
## grade8         3.218e+05  2.181e+05   1.475 0.140136    
## grade9         5.307e+05  2.181e+05   2.433 0.014993 *  
## grade10        7.864e+05  2.182e+05   3.604 0.000314 ***
## grade11        1.157e+06  2.184e+05   5.296 1.19e-07 ***
## grade12        1.739e+06  2.193e+05   7.930 2.30e-15 ***
## grade13        3.157e+06  2.264e+05  13.943  < 2e-16 ***
## lat            6.039e+05  1.084e+04  55.695  < 2e-16 ***
## view1          1.566e+05  1.217e+04  12.870  < 2e-16 ***
## view2          1.079e+05  7.351e+03  14.677  < 2e-16 ***
## view3          1.898e+05  1.006e+04  18.869  < 2e-16 ***
## view4          5.349e+05  1.270e+04  42.104  < 2e-16 ***
## sqft_basement  1.202e+02  3.557e+00  33.787  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218100 on 21595 degrees of freedom
## Multiple R-squared:  0.6474, Adjusted R-squared:  0.6471 
## F-statistic:  2332 on 17 and 21595 DF,  p-value: < 2.2e-16
mod4b <- lm(price ~ grade + lat + view + waterfront, data = houses_tidy)
summary(mod4b)
## 
## Call:
## lm(formula = price ~ grade + lat + view + waterfront, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1711573  -105554   -21218    69955  4959192 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -30635387     563469 -54.369  < 2e-16 ***
## grade3            190754     254602   0.749 0.453729    
## grade4             77432     224255   0.345 0.729882    
## grade5            105683     220940   0.478 0.632416    
## grade6            147178     220538   0.667 0.504551    
## grade7            235834     220496   1.070 0.284830    
## grade8            353620     220503   1.604 0.108796    
## grade9            560970     220528   2.544 0.010974 *  
## grade10           822756     220586   3.730 0.000192 ***
## grade11          1195804     220772   5.416 6.14e-08 ***
## grade12          1800332     221736   8.119 4.94e-16 ***
## grade13          3349896     228854  14.638  < 2e-16 ***
## lat               647525      10910  59.354  < 2e-16 ***
## view1             198491      12233  16.226  < 2e-16 ***
## view2             138307       7358  18.797  < 2e-16 ***
## view3             224548      10064  22.311  < 2e-16 ***
## view4             365071      15646  23.334  < 2e-16 ***
## waterfrontTRUE    550012      21544  25.529  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 220500 on 21595 degrees of freedom
## Multiple R-squared:  0.6396, Adjusted R-squared:  0.6393 
## F-statistic:  2254 on 17 and 21595 DF,  p-value: < 2.2e-16
mod4c <- lm(price ~ grade + lat + view + condition, data = houses_tidy)
summary(mod4c)
## 
## Call:
## lm(formula = price ~ grade + lat + view + condition, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1653876  -105002   -18772    71134  5519169 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -30235579     562306 -53.771  < 2e-16 ***
## grade3         131612     257186   0.512 0.608841    
## grade4          46792     227085   0.206 0.836750    
## grade5          60902     223799   0.272 0.785528    
## grade6         102319     223641   0.458 0.647306    
## grade7         195346     223637   0.873 0.382404    
## grade8         323199     223648   1.445 0.148438    
## grade9         534908     223677   2.391 0.016791 *  
## grade10        800844     223736   3.579 0.000345 ***
## grade11       1184526     223920   5.290 1.24e-07 ***
## grade12       1794154     224872   7.979 1.55e-15 ***
## grade13       3283753     231841  14.164  < 2e-16 ***
## lat            639114      10889  58.694  < 2e-16 ***
## view1          189011      12204  15.488  < 2e-16 ***
## view2          132179       7344  17.999  < 2e-16 ***
## view3          232858      10012  23.258  < 2e-16 ***
## view4          577160      12688  45.488  < 2e-16 ***
## condition2      15201      44266   0.343 0.731306    
## condition3       8443      41173   0.205 0.837521    
## condition4      71904      41207   1.745 0.081006 .  
## condition5     148832      41433   3.592 0.000329 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219800 on 21592 degrees of freedom
## Multiple R-squared:  0.6418, Adjusted R-squared:  0.6415 
## F-statistic:  1935 on 20 and 21592 DF,  p-value: < 2.2e-16
mod4d <- lm(price ~ grade + lat + view + renovated, data = houses_tidy)
summary(mod4d)
## 
## Call:
## lm(formula = price ~ grade + lat + view + renovated, data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1637888  -105444   -20225    70144  5353486 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -29993826     566490 -52.947  < 2e-16 ***
## grade3           188105     255935   0.735 0.462365    
## grade4            72163     225430   0.320 0.748886    
## grade5           107085     222097   0.482 0.629702    
## grade6           141858     221693   0.640 0.522256    
## grade7           231058     221652   1.042 0.297221    
## grade8           348367     221658   1.572 0.116048    
## grade9           552595     221684   2.493 0.012684 *  
## grade10          817843     221742   3.688 0.000226 ***
## grade11         1199479     221928   5.405 6.56e-08 ***
## grade12         1811114     222898   8.125 4.70e-16 ***
## grade13         3265820     230045  14.196  < 2e-16 ***
## lat              634027      10968  57.805  < 2e-16 ***
## view1            190965      12306  15.519  < 2e-16 ***
## view2            137231       7400  18.546  < 2e-16 ***
## view3            233695      10100  23.138  < 2e-16 ***
## view4            575708      12815  44.924  < 2e-16 ***
## renovatedTRUE    154728       7545  20.506  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 221600 on 21595 degrees of freedom
## Multiple R-squared:  0.6358, Adjusted R-squared:  0.6355 
## F-statistic:  2218 on 17 and 21595 DF,  p-value: < 2.2e-16
# looks like model with sqft_basement is best, keep mod4a
par(mfrow = c(2, 2))
plot(mod4a)
## Warning: not plotting observations with leverage one:
##   19453

houses_resid <- houses_tidy %>%
  add_residuals(mod4a) %>%
  select(- price)

Our final model in terms of main effects is: price ~ grade + lat + view + sqft_basement


5 Extensions

  • Consider possible interactions between your four main effect predictors and test their effect upon \(r^2\). Choose your best candidate interaction and visualise its effect.

  • Calculate the relative importance of predictors from your best \(4\)-predictor model (i.e. the model without an interaction). Which predictor affects price most strongly?

Solution

Now, for interactions, have six possibilities that obey principle of strong hierarchy (i.e. consider including an interaction only if its main effects are already present in the model)

mod5a <- lm(price ~ grade + lat + view + sqft_basement + grade:lat, data = houses_tidy)
summary(mod5a)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + grade:lat, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1614350  -100929   -21140    67905  5265776 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.573e+07  4.389e+07  -0.358   0.7200    
## grade3         2.660e+06  6.068e+07   0.044   0.9650    
## grade4         8.637e+06  4.759e+07   0.181   0.8560    
## grade5         2.062e+06  4.422e+07   0.047   0.9628    
## grade6        -3.792e+06  4.393e+07  -0.086   0.9312    
## grade7        -8.844e+06  4.390e+07  -0.201   0.8403    
## grade8        -1.241e+07  4.390e+07  -0.283   0.7773    
## grade9        -2.805e+07  4.392e+07  -0.639   0.5230    
## grade10       -4.186e+07  4.399e+07  -0.952   0.3414    
## grade11       -4.276e+07  4.430e+07  -0.965   0.3345    
## grade12       -1.069e+08  4.593e+07  -2.327   0.0200 *  
## grade13        3.184e+06  2.441e+05  13.040   <2e-16 ***
## lat            3.340e+05  9.234e+05   0.362   0.7176    
## view1          1.559e+05  1.208e+04  12.906   <2e-16 ***
## view2          1.095e+05  7.300e+03  14.999   <2e-16 ***
## view3          1.945e+05  9.992e+03  19.465   <2e-16 ***
## view4          5.322e+05  1.262e+04  42.157   <2e-16 ***
## sqft_basement  1.206e+02  3.532e+00  34.143   <2e-16 ***
## grade3:lat    -5.347e+04  1.279e+06  -0.042   0.9667    
## grade4:lat    -1.804e+05  1.001e+06  -0.180   0.8570    
## grade5:lat    -4.133e+04  9.303e+05  -0.044   0.9646    
## grade6:lat     8.265e+04  9.241e+05   0.089   0.9287    
## grade7:lat     1.904e+05  9.235e+05   0.206   0.8366    
## grade8:lat     2.679e+05  9.236e+05   0.290   0.7717    
## grade9:lat     6.011e+05  9.240e+05   0.651   0.5154    
## grade10:lat    8.964e+05  9.255e+05   0.969   0.3328    
## grade11:lat    9.230e+05  9.320e+05   0.990   0.3220    
## grade12:lat    2.283e+06  9.663e+05   2.362   0.0182 *  
## grade13:lat           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216500 on 21585 degrees of freedom
## Multiple R-squared:  0.6527, Adjusted R-squared:  0.6523 
## F-statistic:  1503 on 27 and 21585 DF,  p-value: < 2.2e-16
mod5b <- lm(price ~ grade + lat + view + sqft_basement + grade:view, data = houses_tidy)
summary(mod5b)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + grade:view, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2194474  -104235   -20452    69331  5114365 
## 
## Coefficients: (11 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.858e+07  5.514e+05 -51.825  < 2e-16 ***
## grade3         1.822e+05  2.480e+05   0.735 0.462370    
## grade4         8.422e+04  2.188e+05   0.385 0.700344    
## grade5         1.122e+05  2.152e+05   0.521 0.602262    
## grade6         1.384e+05  2.148e+05   0.645 0.519252    
## grade7         2.086e+05  2.148e+05   0.971 0.331361    
## grade8         3.237e+05  2.148e+05   1.507 0.131773    
## grade9         5.378e+05  2.148e+05   2.504 0.012290 *  
## grade10        7.624e+05  2.149e+05   3.548 0.000389 ***
## grade11        1.069e+06  2.152e+05   4.969 6.77e-07 ***
## grade12        1.806e+06  2.172e+05   8.313  < 2e-16 ***
## grade13        2.763e+06  2.353e+05  11.745  < 2e-16 ***
## lat            6.042e+05  1.068e+04  56.547  < 2e-16 ***
## view1          1.034e+06  1.554e+05   6.657 2.87e-11 ***
## view2         -2.830e+05  1.568e+05  -1.805 0.071132 .  
## view3          2.136e+06  1.797e+05  11.883  < 2e-16 ***
## view4          1.357e+06  1.568e+05   8.652  < 2e-16 ***
## sqft_basement  1.163e+02  3.524e+00  33.002  < 2e-16 ***
## grade3:view1          NA         NA      NA       NA    
## grade4:view1  -9.306e+05  2.213e+05  -4.205 2.62e-05 ***
## grade5:view1  -1.003e+06  2.177e+05  -4.605 4.16e-06 ***
## grade6:view1  -9.869e+05  1.650e+05  -5.979 2.28e-09 ***
## grade7:view1  -9.138e+05  1.570e+05  -5.822 5.90e-09 ***
## grade8:view1  -8.973e+05  1.567e+05  -5.725 1.05e-08 ***
## grade9:view1  -8.906e+05  1.578e+05  -5.645 1.67e-08 ***
## grade10:view1 -8.115e+05  1.611e+05  -5.037 4.78e-07 ***
## grade11:view1 -5.043e+05  1.640e+05  -3.076 0.002104 ** 
## grade12:view1         NA         NA      NA       NA    
## grade13:view1         NA         NA      NA       NA    
## grade3:view2          NA         NA      NA       NA    
## grade4:view2   3.451e+05  2.692e+05   1.282 0.199908    
## grade5:view2   4.383e+05  1.802e+05   2.432 0.015033 *  
## grade6:view2   3.968e+05  1.604e+05   2.474 0.013350 *  
## grade7:view2   3.641e+05  1.575e+05   2.312 0.020806 *  
## grade8:view2   3.970e+05  1.573e+05   2.524 0.011620 *  
## grade9:view2   3.892e+05  1.577e+05   2.468 0.013584 *  
## grade10:view2  3.867e+05  1.584e+05   2.441 0.014654 *  
## grade11:view2  6.733e+05  1.599e+05   4.211 2.56e-05 ***
## grade12:view2  7.941e+04  1.709e+05   0.465 0.642233    
## grade13:view2         NA         NA      NA       NA    
## grade3:view3          NA         NA      NA       NA    
## grade4:view3          NA         NA      NA       NA    
## grade5:view3  -2.041e+06  2.804e+05  -7.280 3.45e-13 ***
## grade6:view3  -1.971e+06  1.887e+05 -10.446  < 2e-16 ***
## grade7:view3  -2.027e+06  1.813e+05 -11.176  < 2e-16 ***
## grade8:view3  -1.954e+06  1.806e+05 -10.820  < 2e-16 ***
## grade9:view3  -1.940e+06  1.807e+05 -10.732  < 2e-16 ***
## grade10:view3 -1.791e+06  1.813e+05  -9.875  < 2e-16 ***
## grade11:view3 -1.870e+06  1.830e+05 -10.218  < 2e-16 ***
## grade12:view3 -2.662e+06  1.948e+05 -13.663  < 2e-16 ***
## grade13:view3         NA         NA      NA       NA    
## grade3:view4          NA         NA      NA       NA    
## grade4:view4          NA         NA      NA       NA    
## grade5:view4  -1.098e+06  1.906e+05  -5.759 8.57e-09 ***
## grade6:view4  -1.024e+06  1.687e+05  -6.069 1.31e-09 ***
## grade7:view4  -9.416e+05  1.615e+05  -5.829 5.65e-09 ***
## grade8:view4  -8.595e+05  1.588e+05  -5.411 6.33e-08 ***
## grade9:view4  -1.031e+06  1.589e+05  -6.485 9.05e-11 ***
## grade10:view4 -5.960e+05  1.592e+05  -3.742 0.000183 ***
## grade11:view4 -5.687e+05  1.616e+05  -3.520 0.000433 ***
## grade12:view4 -6.514e+05  1.667e+05  -3.909 9.31e-05 ***
## grade13:view4         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 214700 on 21562 degrees of freedom
## Multiple R-squared:  0.6587, Adjusted R-squared:  0.6579 
## F-statistic: 832.1 on 50 and 21562 DF,  p-value: < 2.2e-16
mod5c <- lm(price ~ grade + lat + view + sqft_basement + grade:sqft_basement, data = houses_tidy)
summary(mod5c)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + grade:sqft_basement, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2261338  -102336   -20466    68443  4995869 
## 
## Coefficients: (2 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.901e+07  5.529e+05 -52.467  < 2e-16 ***
## grade3                 1.840e+05  2.486e+05   0.740 0.459091    
## grade4                 7.449e+04  2.191e+05   0.340 0.733841    
## grade5                 1.080e+05  2.157e+05   0.501 0.616656    
## grade6                 1.427e+05  2.153e+05   0.663 0.507414    
## grade7                 2.189e+05  2.153e+05   1.017 0.309241    
## grade8                 3.323e+05  2.153e+05   1.544 0.122673    
## grade9                 5.215e+05  2.153e+05   2.422 0.015444 *  
## grade10                7.604e+05  2.154e+05   3.530 0.000416 ***
## grade11                1.071e+06  2.157e+05   4.965 6.91e-07 ***
## grade12                1.602e+06  2.172e+05   7.375 1.71e-13 ***
## grade13                2.022e+06  2.352e+05   8.598  < 2e-16 ***
## lat                    6.133e+05  1.071e+04  57.241  < 2e-16 ***
## view1                  1.530e+05  1.202e+04  12.723  < 2e-16 ***
## view2                  1.062e+05  7.274e+03  14.600  < 2e-16 ***
## view3                  1.722e+05  9.991e+03  17.238  < 2e-16 ***
## view4                  5.077e+05  1.266e+04  40.109  < 2e-16 ***
## sqft_basement          9.238e+02  5.159e+01  17.906  < 2e-16 ***
## grade3:sqft_basement          NA         NA      NA       NA    
## grade4:sqft_basement  -6.202e+01  1.097e+03  -0.057 0.954902    
## grade5:sqft_basement  -8.016e+02  1.056e+02  -7.591 3.32e-14 ***
## grade6:sqft_basement  -8.565e+02  5.464e+01 -15.675  < 2e-16 ***
## grade7:sqft_basement  -8.514e+02  5.193e+01 -16.396  < 2e-16 ***
## grade8:sqft_basement  -8.351e+02  5.197e+01 -16.069  < 2e-16 ***
## grade9:sqft_basement  -7.699e+02  5.227e+01 -14.730  < 2e-16 ***
## grade10:sqft_basement -7.336e+02  5.271e+01 -13.918  < 2e-16 ***
## grade11:sqft_basement -6.381e+02  5.375e+01 -11.871  < 2e-16 ***
## grade12:sqft_basement -6.263e+02  5.601e+01 -11.182  < 2e-16 ***
## grade13:sqft_basement         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 215300 on 21586 degrees of freedom
## Multiple R-squared:  0.6566, Adjusted R-squared:  0.6562 
## F-statistic:  1588 on 26 and 21586 DF,  p-value: < 2.2e-16
mod5d <- lm(price ~ grade + lat + view + sqft_basement + lat:view, data = houses_tidy)
summary(mod5d)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + lat:view, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1679367  -103252   -20586    69103  5277777 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.681e+07  5.752e+05 -46.605  < 2e-16 ***
## grade3         1.749e+05  2.508e+05   0.698 0.485405    
## grade4         7.832e+04  2.209e+05   0.355 0.722900    
## grade5         1.085e+05  2.176e+05   0.499 0.618094    
## grade6         1.376e+05  2.172e+05   0.634 0.526358    
## grade7         2.073e+05  2.172e+05   0.955 0.339806    
## grade8         3.231e+05  2.172e+05   1.488 0.136877    
## grade9         5.318e+05  2.172e+05   2.448 0.014358 *  
## grade10        7.862e+05  2.173e+05   3.619 0.000296 ***
## grade11        1.153e+06  2.174e+05   5.304 1.14e-07 ***
## grade12        1.731e+06  2.184e+05   7.925 2.40e-15 ***
## grade13        3.118e+06  2.254e+05  13.832  < 2e-16 ***
## lat            5.669e+05  1.121e+04  50.596  < 2e-16 ***
## view1         -2.281e+07  5.489e+06  -4.156 3.26e-05 ***
## view2         -1.762e+07  2.730e+06  -6.456 1.10e-10 ***
## view3         -2.067e+07  3.479e+06  -5.941 2.87e-09 ***
## view4         -5.281e+07  5.178e+06 -10.201  < 2e-16 ***
## sqft_basement  1.188e+02  3.544e+00  33.510  < 2e-16 ***
## lat:view1      4.828e+05  1.154e+05   4.185 2.87e-05 ***
## lat:view2      3.728e+05  5.739e+04   6.496 8.44e-11 ***
## lat:view3      4.387e+05  7.317e+04   5.996 2.06e-09 ***
## lat:view4      1.121e+06  1.088e+05  10.304  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 217200 on 21591 degrees of freedom
## Multiple R-squared:  0.6505, Adjusted R-squared:  0.6501 
## F-statistic:  1913 on 21 and 21591 DF,  p-value: < 2.2e-16
mod5e <- lm(price ~ grade + lat + view + sqft_basement + lat:sqft_basement, data = houses_tidy)
summary(mod5e)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + lat:sqft_basement, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1700954  -101315   -19149    68149  5245821 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.484e+07  6.302e+05 -39.414  < 2e-16 ***
## grade3             1.668e+05  2.509e+05   0.665 0.506163    
## grade4             7.638e+04  2.210e+05   0.346 0.729639    
## grade5             1.057e+05  2.177e+05   0.486 0.627219    
## grade6             1.366e+05  2.173e+05   0.628 0.529733    
## grade7             2.071e+05  2.173e+05   0.953 0.340637    
## grade8             3.233e+05  2.173e+05   1.488 0.136800    
## grade9             5.323e+05  2.173e+05   2.449 0.014328 *  
## grade10            7.889e+05  2.174e+05   3.629 0.000285 ***
## grade11            1.159e+06  2.176e+05   5.325 1.02e-07 ***
## grade12            1.739e+06  2.185e+05   7.959 1.81e-15 ***
## grade13            3.132e+06  2.256e+05  13.884  < 2e-16 ***
## lat                5.256e+05  1.245e+04  42.228  < 2e-16 ***
## view1              1.554e+05  1.213e+04  12.812  < 2e-16 ***
## view2              1.095e+05  7.325e+03  14.952  < 2e-16 ***
## view3              1.907e+05  1.002e+04  19.027  < 2e-16 ***
## view4              5.325e+05  1.266e+04  42.068  < 2e-16 ***
## sqft_basement     -1.633e+04  1.298e+03 -12.578  < 2e-16 ***
## lat:sqft_basement  3.458e+02  2.729e+01  12.671  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 217300 on 21594 degrees of freedom
## Multiple R-squared:   0.65,  Adjusted R-squared:  0.6497 
## F-statistic:  2228 on 18 and 21594 DF,  p-value: < 2.2e-16
mod5f <- lm(price ~ grade + lat + view + sqft_basement + view:sqft_basement, data = houses_tidy)
summary(mod5f)
## 
## Call:
## lm(formula = price ~ grade + lat + view + sqft_basement + view:sqft_basement, 
##     data = houses_tidy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1765104  -103878   -19859    69921  5256041 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -2.863e+07  5.591e+05 -51.208  < 2e-16 ***
## grade3               1.825e+05  2.515e+05   0.725 0.468204    
## grade4               8.250e+04  2.216e+05   0.372 0.709627    
## grade5               1.097e+05  2.183e+05   0.503 0.615290    
## grade6               1.383e+05  2.179e+05   0.635 0.525565    
## grade7               2.090e+05  2.179e+05   0.959 0.337456    
## grade8               3.250e+05  2.179e+05   1.492 0.135731    
## grade9               5.329e+05  2.179e+05   2.446 0.014465 *  
## grade10              7.884e+05  2.179e+05   3.617 0.000298 ***
## grade11              1.155e+06  2.181e+05   5.296 1.20e-07 ***
## grade12              1.723e+06  2.191e+05   7.862 3.94e-15 ***
## grade13              3.135e+06  2.262e+05  13.863  < 2e-16 ***
## lat                  6.054e+05  1.083e+04  55.877  < 2e-16 ***
## view1                1.159e+05  1.886e+04   6.147 8.03e-10 ***
## view2                9.032e+04  1.026e+04   8.802  < 2e-16 ***
## view3                1.482e+05  1.498e+04   9.895  < 2e-16 ***
## view4                4.715e+05  1.819e+04  25.924  < 2e-16 ***
## sqft_basement        1.080e+02  3.960e+00  27.268  < 2e-16 ***
## view1:sqft_basement  7.193e+01  2.310e+01   3.113 0.001853 ** 
## view2:sqft_basement  3.896e+01  1.330e+01   2.928 0.003411 ** 
## view3:sqft_basement  6.562e+01  1.571e+01   4.176 2.98e-05 ***
## view4:sqft_basement  8.729e+01  1.652e+01   5.284 1.28e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 217800 on 21591 degrees of freedom
## Multiple R-squared:  0.6483, Adjusted R-squared:  0.6479 
## F-statistic:  1895 on 21 and 21591 DF,  p-value: < 2.2e-16
# mod5c looks like the best
par(mfrow = c(2,2))
plot(mod5c)
## Warning: not plotting observations with leverage one:
##   8620, 19453

It seems that the grade:sqft_basement interaction leads to highest \(r^2\) (but two levels of the interaction cannot be determined due to fitting problems).

Now let’s see a visualisation of the effect of this interaction.

houses_resid %>%
  ggplot(aes(x = sqft_basement, y = resid, colour = grade)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ grade)
## `geom_smooth()` using formula 'y ~ x'

Relative importance of predictors:

library(relaimpo)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: boot
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
## 
##     aml
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
## Loading required package: mitools
## This is the global version of package relaimpo.
## If you are a non-US user, a version with the interesting additional metric pmvd is available
## from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.
calc.relimp(mod4a, method = "lmg", rela = TRUE)
## Response variable: price 
## Total response variance: 134782378397 
## Analysis based on 21613 observations 
## 
## 17 Regressors: 
## Some regressors combined in groups: 
##         Group  grade : grade3 grade4 grade5 grade6 grade7 grade8 grade9 grade10 grade11 grade12 grade13 
##         Group  view : view1 view2 view3 view4 
## 
##  Relative importance of 4 (groups of) regressors assessed: 
##  grade view lat sqft_basement 
##  
## Proportion of variance explained by model: 64.74%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                     lmg
## grade         0.6613505
## view          0.1475548
## lat           0.1082255
## sqft_basement 0.0828692
## 
## Average coefficients for different model sizes: 
## 
##                     1group      2groups      3groups      4groups
## grade3          63666.6672  105085.5034  143973.4826  182193.0397
## grade4          72381.0351   73237.4679   75552.2416   78956.0945
## grade5         106523.9716  105536.1415  105761.2601  107192.1633
## grade6         159919.6380  149061.7309  141055.2803  135919.3905
## grade7         260590.2629  236127.3021  217651.9965  205378.9824
## grade8         400852.7662  366830.4551  340559.3902  321796.9099
## grade9         631513.1864  589253.7503  555932.0020  530680.8349
## grade10        929771.0746  870361.6477  823031.9018  786434.2433
## grade11       1354841.7274 1272451.5902 1207113.0171 1156627.9552
## grade12       2049222.0006 1921648.1390 1819293.0741 1739316.4875
## grade13       3567615.3852 3394076.8577 3257787.9369 3156660.3522
## lat            813411.5832  720920.2912  657688.6172  603904.7514
## view1          315716.6452  250081.3021  198343.3721  156626.5028
## view2          295836.6957  222400.5323  160854.6835  107885.2600
## view3          475401.0824  364804.3448  271176.6966  189797.3285
## view4          967147.0461  802059.7321  660137.1873  534858.0263
## sqft_basement     268.6136     203.6304     154.3004     120.1816

It looks like the grade of property is the most important determiner of price, followed by how good the view of the property was.